:1 Price
:2 Description provided by the critic.
library(tidyverse)
## -- Attaching packages ------------------------------------------------------------------------------------------------------------------------------ tidyverse 1.3.0 --
## v ggplot2 3.2.1 v purrr 0.3.3
## v tibble 2.1.3 v dplyr 0.8.3
## v tidyr 1.0.2 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0
## -- Conflicts --------------------------------------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(DataComputing)
## Registered S3 method overwritten by 'mosaic':
## method from
## fortify.SpatialPolygonsDataFrame ggplot2
library(leaflet)
library(tidytext)
## Warning: package 'tidytext' was built under R version 3.6.3
options(warn=-1)
-I would like to do an informative research on wine price which include appreciation and depreciation. What factor drives wine price of the wine up and down? What makes wine A more expensive that wine B?
-The data is from Kaggle.
-The data was collected by winemag a wine tasting magazine.
-Each case represents each wine.
-There are over 130 thousand cases.
-Wine region wine score wine price and variant
Wine<- read.csv(file = 'winemag-data-130k-v2.csv',header = TRUE, stringsAsFactors = FALSE)
Wine
Prepare the data for further step
Find the average price of wine
Remove all the NA
wine2<-na.omit(Wine)
avg<-
wine2%>%
group_by(country)%>%
summarise (avg = mean(price),Score =mean(points))
avg
ggplot(data=avg,aes(x=avg,y=Score, ymin=80))+geom_point()
finding 95 percentile of the wine price
Seems to have a corelation between price and points on the average of each country
wine2 %>%
group_by(country) %>%
summarise(quantile = scales::percent(c(0.95)),
price = quantile(price, c(0.95)),score = quantile(points, c(0.95)))
Finding bottom 5
wine2 %>%
group_by(country) %>%
summarise(quantile = scales::percent(c(0.05)),
price = quantile(price, c(0.05)),score= quantile(points, c(0.05)))
As we can see that the scale of the wine is not 0-100 points but start approximately 80
So when we create a graph for the wine we must set the scale to 80-100
p95<-
wine2 %>% filter(points > quantile(points, 0.95))
p05<-
wine2 %>% filter(points < quantile(points, 0.05))
warn<-ggplot(data=p95,aes(x=price,y=points,ymin=96, ymax=100))+geom_point()+facet_wrap(~country,ncol=4) + stat_smooth(method="auto")
suppressWarnings(print(warn))
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
warn<-ggplot(data=p05,aes(x=price,y=points,ymin=80, ymax=100))+geom_point()+facet_wrap(~country,ncol=4)+ stat_smooth(method="auto")
suppressWarnings(print(warn))
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
There is no linear corelation with the price and point but the average of top 5 and bottom 5 vary by between 20 to 70 usd so the higher point wine
Focus on the American wine Because
1. American wine can be found easily here in the United States
2. Can easily compare because in an area for example NAPA valley there is a production of both cheap and expensive wine
3. Good split between the top 5 and bottom 5 2793 and 2784
filtered95 <- p95 %>%
filter(country == "US")
tokenized_comments95 <- filtered95 %>%
select(description, designation, points, price, province, region_1, variety, winery) %>%
unnest_tokens(word, description) %>%
anti_join(stop_words) %>%
filter(word != "wine") %>%
filter(province != "America")%>%
group_by(province, word) %>%
tally()
## Joining, by = "word"
tokenized_comments95 %>% glimpse()
## Observations: 10,894
## Variables: 3
## Groups: province [4]
## $ province <chr> "California", "California", "California", "California", "C...
## $ word <chr> "02", "03", "04", "05", "06", "064", "07", "08", "09", "1"...
## $ n <int> 1, 1, 2, 13, 9, 1, 7, 3, 9, 3, 3, 1, 1, 1, 1, 1, 3, 1, 1, ...
tokenized_comments95 %>%
group_by(province) %>%
top_n(15) %>%
arrange(desc(n)) %>%
ggplot(aes(x = reorder(word, n), y = n, fill = factor(province))) +
geom_bar(stat = "identity") +
theme(legend.position = "none") +
facet_wrap(~ province, scales = "free") +
coord_flip() +
labs(x = "Frequency",
y = "Top words",
title = "Top 5 Percent US wine",
subtitle = "")
## Selecting by n
filtered05 <- p05 %>%
filter(country == "US")
tokenized_comments05 <- filtered05 %>%
select(description, designation, points, price, province, region_1, variety, winery) %>%
unnest_tokens(word, description) %>%
anti_join(stop_words) %>%
filter(word != "wine") %>%
filter(province != "America")%>%
group_by(province, word) %>%
tally()
## Joining, by = "word"
tokenized_comments05 %>% glimpse()
## Observations: 8,123
## Variables: 3
## Groups: province [21]
## $ province <chr> "Arizona", "Arizona", "Arizona", "Arizona", "Arizona", "Ar...
## $ word <chr> "appealing", "apples", "aromas", "berry", "blanc", "blend"...
## $ n <int> 1, 1, 3, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
tokenized_comments05 %>%
group_by(province) %>%
top_n(15) %>%
arrange(desc(n)) %>%
ggplot(aes(x = reorder(word, n), y = n, fill = factor(province))) +
geom_bar(stat = "identity") +
theme(legend.position = "none") +
facet_wrap(~ province, scales = "free") +
coord_flip() +
labs(x = "Frequency",
y = "Top words",
title = "Bottom 5 Percent US wine",
subtitle = "")
## Selecting by n
As we can see the top 5 percent of wine from the United States are from the 4 states(California, New York, Oregon, and Washington)
latlong<- read.csv(file = 'statelatlong.csv',header = TRUE, stringsAsFactors = FALSE)
latlong
Start with Bottom 5 Wine
wineUS05<-
tokenized_comments05 %>%
inner_join(latlong,by = c("province" = "City"))
wineUS051<- wineUS05%>%
group_by(province)%>%
summarise(max(n),)
wineUS05 <-wineUS051%>%
left_join(wineUS05, by = c('province'='province', 'max(n)'='n'))
wineUS05<-
select(wineUS05,-c(State))
# Show first 20 rows from the `quakes` dataset
leaflet(data = wineUS05) %>% addTiles() %>%
addMarkers(~Longitude, ~Latitude, popup = ~word, label = ~word)
Top 5 wine
wineUS95<-
tokenized_comments95 %>%
inner_join(latlong,by = c("province" = "City"))
wineUS951<- wineUS95%>%
group_by(province)%>%
summarise(max(n),)
wineUS95 <-wineUS951%>%
left_join(wineUS95, by = c('province'='province', 'max(n)'='n'))
wineUS95<-
select(wineUS95,-c(State))
# Show first 20 rows from the `quakes` dataset
leaflet(data = wineUS95) %>% addTiles() %>%
addMarkers(~Longitude, ~Latitude, popup = ~word, label = ~word)
Seems like for even the bottom 5 the word is very similar to the top 5 but you can see that some word such as “faint, kicking, sour” is still appear in the top 5 but less so than the bottom 5
wineUS05 %>%
pivot_wider(names_from = word, values_from = 'max(n)')
There are 42 different words from the Bottom 5 and most of it are NA due to the fact that most of the states only make one wine and it wasn't rated highly
The word that appear the most is flavors and palate
wineUS95 %>%
pivot_wider(names_from = word, values_from = 'max(n)')
The top 5 wine is also have a lot of flavors and palate but also the 'fruit' which is absent in the bottom 5 I suspect that bottom 5 wine do have decent taste but lack the fruit flavor
filtered05 %>%
filter(grepl('faint|kicking|sour', description, ignore.case = TRUE))
210 out of 2793 bottom wine contains which is 7.5 percent
filtered95 %>%
filter(grepl(' faint| kicking| sour', description, ignore.case = TRUE))
140 out of 2784 top wine contains which is 5 percent
filtered05 %>%
filter(grepl('fruit', description, ignore.case = TRUE))
728 out of 2793 wine contain which is 26 percent
filtered95 %>%
filter(grepl('fruit', description, ignore.case = TRUE))
1140 out of 2784 wine contain which is 41 percent
We then try to compare the 2 word to word
compare_token <-tokenized_comments05%>%
inner_join(tokenized_comments95, by = c('province'='province', 'word'='word'))%>%
mutate(difference = (n.y/n.x))%>%
arrange(desc(difference))
compare_token
The word such as develop, refined, perfect, delicius, elegance can be found much more promenently in the top wine
compare_token <-tokenized_comments95%>%
inner_join(tokenized_comments05, by = c('province'='province', 'word'='word'))%>%
mutate(difference = (n.y/n.x))%>%
arrange(desc(difference))
compare_token
The word such as Thin, simple, un ripe, dull, harsh can be found much more promenently in the bottom wine
filtered05%>%
summarise (avg = mean(price),Score =mean(points))
filtered95%>%
summarise (avg = mean(price),Score =mean(points))
Price difference of 55 usd and the score difference of 12
With the data from step 3 it seems to suggest that
1.Wine with top wine with high score tends to have a description that is more positive such as develop, refined, perfect, delicius, elegance.
Wine with lower quality such as the bottom 5 tend to have a very neutral to negative description Thin, simple, un ripe, dull, harsh
The word can also be on either the bottom or top so it's not a sure thing
2. Wine that are more expensive tend to score higher than the cheaper counter part (this could imply that the price affect the score because people might bias toward more expensive comodity) The price is not the sure indication of the quality as you can see that some cheap wine can score very highly and vise versa.